Lego Analysis

Today we’ll dive deep into a dataset all about LEGO. From the dataset we can ask whole bunch of interesting questions about the history of the LEGO company, their product offering, and which LEGO set ultimately rules them all:

Rebrickable has compiled data on all the LEGO pieces in existence. I recommend you use download the .csv files provided in this lesson.

Challenge: How many different colours does the LEGO company produce? Read the colors.csv file in the data folder and find the total number of unique colours. Try using the .nunique() method to accomplish this.

Code
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Markdown

colors = pd.read_csv("data/colors.csv")

colors.name.nunique()
135

Challenge: Find the number of transparent colours where is_trans == 't' versus the number of opaque colours where is_trans == 'f'. See if you can accomplish this in two different ways.

Alternative 1:

Code
colors.groupby("is_trans").name.nunique()
is_trans
f    107
t     28
Name: name, dtype: int64

Alternative 2:

Code
print(f'The number of opaque colors is :{colors[colors.is_trans == "f"].name.nunique()}')
print(f'The number of transparent colors is :{colors[colors.is_trans == "t"].name.nunique()}')
The number of opaque colors is :107
The number of transparent colors is :28

Understanding LEGO Themes vs. LEGO Sets

Walk into a LEGO store and you will see their products organised by theme. Their themes include Star Wars, Batman, Harry Potter and many more.

A lego set is a particular box of LEGO or product. Therefore, a single theme typically has many different sets.

The sets.csv data contains a list of sets over the years and the number of parts that each of these sets contained.

Challenge: Read the sets.csv data and take a look at the first and last couple of rows.

Code
sets = pd.read_csv("data/sets.csv")
sets.head(3)
sets.tail(3)
set_num name year theme_id num_parts
15707 XWING-1 Mini X-Wing Fighter 2019 158 60
15708 XWING-2 X-Wing Trench Run 2019 158 52
15709 YODACHRON-1 Yoda Chronicles Promotional Set 2013 158 413

Challenge: In which year were the first LEGO sets released and what were these sets called?

The set Extra-Large Gift Set (ABB) was the earliest set, and it was released in 1949.

Challenge: How many different sets did LEGO sell in their first year? How many types of LEGO products were on offer in the year the company started?

The year the company started (1949), it sold 5 different sets that were part of 2 unique themes.

Challenge: Find the top 5 LEGO sets with the most number of parts.

Code
sets.sort_values(by = "num_parts", ascending = False).head(5)
set_num name year theme_id num_parts
15004 BIGBOX-1 The Ultimate Battle for Chima 2015 571 9987
11183 75192-1 UCS Millennium Falcon 2017 171 7541
10551 71043-1 Hogwarts Castle 2018 246 6020
295 10256-1 Taj Mahal 2017 673 5923
221 10189-1 Taj Mahal 2008 673 5922

Challenge: Use .groupby() and .count() to show the number of LEGO sets released year-on-year. How do the number of sets released in 1955 compare to the number of sets released in 2019?

The number of sets released in 2019 was 840, 30.0 times more than the number of sets released in 1955 (28).

Challenge: Show the number of LEGO releases on a line chart using Matplotlib.

Note that the .csv file is from late 2020, so to plot the full calendar years, you will have to exclude some data from your chart. Can you use the slicing techniques covered in Day 21 to avoid plotting the last two years? The same syntax will work on Pandas DataFrames.

Code
yearly_releases_to_plot = yearly_releases[yearly_releases.index < 2020]

plt.xlabel("Year")
plt.ylabel("Number of sets released")
plt.plot(yearly_releases_to_plot.index.to_numpy(), yearly_releases_to_plot.name.to_numpy())
plt.show()

Number of Lego sets released by year

Aggregate Data with the Python .agg() Function

Let's work out the number of different themes shipped by year. This means we have to count the number of unique theme_ids per calendar year.

Challenge: Plot the number of themes released by year on a line chart. Only include the full calendar years (i.e., exclude 2020 and 2021).

Code
themes_released = sets[sets.year < 2020].groupby("year").theme_id.nunique()

plt.xlabel("Year")
plt.ylabel("Number of themes released")
plt.plot(themes_released.index.to_numpy(), themes_released.to_numpy())
plt.show()

Number of Lego themes released by year

Challenge: Use the .groupby() and .agg() function together to figure out the average number of parts per set. How many parts did the average LEGO set released in 1954 compared to say, 2017?

In 1954 a LEGO set would have 13.3 parts in average, merely a 5.8% of the 229.4 parts that a set had in average in 2017.

Challenge: Has the size and complexity of LEGO sets increased over time based on the number of parts? Plot the average number of parts over time using a Matplotlib scatter plot. See if you can use the scatter plot documentation before I show you the solution. Do you spot a trend in the chart?

Code
average_parts_to_plot = average_parts[average_parts.index < 2020]

plt.xlabel("Year")
plt.ylabel("Average number of parts per set")

plt.scatter(average_parts_to_plot.index, average_parts_to_plot.average_parts)
plt.show()

Evolution of the average number of parts

LEGO has licensed many hit franchises from Harry Potter to Marvel Super Heros to many others. But which theme has the largest number of individual sets?

Challenge Use what you know about HTML markup and tags to display the database schema: https://i.imgur.com/Sg4lcjx.png

The themes.csv file has the actual theme names. The sets.csv has theme_ids which link to the id column in the themes.csv.

Challenge: Explore the themes.csv. How is it structured? Search for the name 'Star Wars'. How many ids correspond to this name in the themes.csv? Now use these ids and find the corresponding the sets in the sets.csv (Hint: you'll need to look for matches in the theme_id column)

Code
themes = pd.read_csv("data/themes.csv")
star_wars_ids = themes[themes.name.str.contains("star wars", case = False)].id.unique().tolist()

sets[sets.theme_id.isin(star_wars_ids)].name
850                    Star Wars: Build Your Own Adventure
855      Parts for Star Wars Build Your Own Adventure: ...
1717                               Clone Turbo Tank - Mini
1728                        Republic Attack Cruiser - Mini
1738                                   AT-TE Walker - Mini
                               ...                        
15686              Star Wars Co-Pack of 7101 7111 and 7171
15689                   Star Wars Co-Pack of 7130 and 7150
15707                                  Mini X-Wing Fighter
15708                                    X-Wing Trench Run
15709                      Yoda Chronicles Promotional Set
Name: name, Length: 776, dtype: object

Merging

Code
theme_count = sets.groupby("theme_id").name.count().rename("set_count", inplace = True)

merged_star_wars = pd.merge(theme_count, themes, left_on = "theme_id", right_on = "id").sort_values(by = "set_count", ascending = False)

plt.ylabel("Number of sets")
plt.gca().invert_yaxis()
plt.barh(merged_star_wars.name[:10], merged_star_wars.set_count[:10])
plt.show()